A Tool Environment for Efficient Execution of Shared Memory Programs on NUMA Systems
نویسنده
چکیده
One of the most important performance issues on NUMA systems is data locality since remote memory accesses have latencies several magnitudes higher than local memory accesses. This paper presents a tool environment targeting at tuning NUMA-based shared memory applications towards better memory locality. This tool environment comprises tools, supporting system facilities, and their interface. Tools include a Data Layout Visualizer (DLV) which offers a program’s memory access histogram in an easy-to-use way allowing an explicite optimization of applications; a Run-time Adaptive System (RAS) which analyzes the low-level memory transactions and redistributes data among processors on-the-fly; and a SIMulation Tool (SIMT), an execution driven multithreads simulator for architecture design and performance evaluation. Experiments on a 4 node NUMA cluster shows that optimization can improve speedup up to 181 factors for small codes. Simulation results show that specific architectures can reduce remote memory accesses as high as 48%.
منابع مشابه
Data locality optimization of shared memory programs on NUMA architectures using an integrated tool environment
Due to their excellent price-performance ratio, clusters built from commodity nodes have become broadly adopted and increasingly popular as platforms for parallel processing. Among them, the clusters of standard PCs interconnected with high-speed system area networks (SANs) are especially attractive and have been widely established. At the same time, the developments in interconnection technolo...
متن کاملTowards Whatever-Scale Abstractions for Data-Driven Parallelism
Increasing diversity in computing systems often requires problems to be solved in quite different ways depending on the workload, data size, and resources available. This diversity is increasingly broad in terms of the organization, communication mechanisms, and performance and cost characteristics of individual machines and clusters. Researchers have thus been motivated to design abstractions ...
متن کاملShared Memory Multiprocessor Architectures for Software IP Routers
In this paper, we propose new shared memory multiprocessor architectures and evaluate their performance for future Internet Protocol (IP) routers based on Symmetric Multi-Processor (SMP) and Cache Coherent Non-Uniform Memory Access (CC-NUMA) paradigms. We also propose a benchmark application suite, RouterBench, which consists of four categories of applications representing key functions on the ...
متن کاملUsing Simulation to Understand the Data Layout of Programs
One of the most prominent performance issues on NUMA systems is the access latency to remote memories, which can be several orders of magnitude higher than the one of local memory accesses. Effective data allocation that limits the necessity to access remote memories therefore has the potential to significantly improve the performance of applications. This paper presents a tool that simulates t...
متن کاملLow-Level Monitoring and High-Level Tuning of UPC on CC-NUMA Architectures
We experiment with various techniques of monitoring and tuning UPC programs while porting NAS NPB benchmark using the recently developed GCC-SGI UPC compiler on the Origin O3800 NUMA machine. The performance of the NAS NPB on the SGI NUMA environment is compared to previous NAS NPB statistics on a Compaq multiprocessor. In fact, the SGI NUMA environment has provided new opportunities for UPC. F...
متن کامل